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Abstract 

In recent years a number of methods have 
been developed for automatically learning the 
(sparse) connectivity structure of Markov Ran- 
dom Fields. These methods are mostly based on 
Li-regularized optimization which has a num- 
ber of disadvantages such as the inability to 
assess model uncertainty and expensive cross- 
validation to find the optimal regularization pa- 
rameter. Moreover, the model's predictive per- 
formance may degrade dramatically with a sub- 
optimal value of the regularization parameter 
(which is sometimes desirable to induce sparse- 
ness). We propose a fully Bayesian approach 
based on a "spike and slab" prior (similar to 
L regularization) that does not suffer from 
these shortcomings. We develop an approxi- 
mate MCMC method combining Langevin dy- 
namics and reversible jump MCMC to conduct 
inference in this model. Experiments show that 
the proposed model learns a good combination 
of the structure and parameter values without 
the need for separate hyper-parameter tuning. 
Moreover, the model's predictive performance 
is much more robust than ii-based methods 
with hyper-parameter settings that induce highly 
sparse model structures. 



1 Introduction 

Undirected probabilistic graphical models, also known as 
Markov Random Fields (MRFs), have been widely used 
in a large variety of domains including computer vision 



dO E009 ), natural language processing ( Sha and Pereira 



2003 ), and social networks (Robins et aL) |2007 1. The struc- 
ture of the model is defined through a set of features defined 
on subsets of random variables. Automated methods to se- 
lect relevant features are becoming increasingly important 
in a time where the proliferation of sensors make it possi- 



ble to measure a multitude of data-attributes. There is also 
an increasing interest in sparse model structures because 
they help against overfitting and are computationally more 
tractable than dense model structures. 

In this paper we focus on a particular type of MRF, called 
a log-linear model, where structure learning or feature se- 
lection is integrated with parameter estimation. Although 
structure learning has been extensively studied for directed 
graphical models, it is typically more difficult for undi- 
rected models due to the intractable normalization term of 
the probability distribution, known as the partition func- 
tion. Traditional algorithms apply only to restricted types 
of structures with low tree- width ( And rew and G ao 2007, 
Tsuruoka et al. 2009 Hu et al. 2009|) or speci al models 



such as Gaussian graphical models ( Jones et al. 2005 1 so 
that accurate inference can be conducted efficiently. 

For an arbitrary structure, various methods have been pro- 
posed in the literature, generally categorized into two ap- 
proaches. One approach is based on separate tests on an 
edge or the neighbourhood of a node so that there is no need 
to compute the join t distribution (fWainwright et al.| |2007| 



Bresler et al. 2008 Ravikumar et al. 20 IP} . The other ap- 
proach is based on maximum likelihood estimation (MLE) 
with a sparsity inducing criterion. These methods require 
approximate inference algorithms in order to estimate the 



log-likelihood such as Gibbs sampling ( Delia Pietr 
19971), loopy belief propagation (|Lee et 



et al. 



2010 1, or pseudo-likelihood (|Hofling and Tibshirani 



aL]j2006, 



a et al.| 
)6[ |ZIui| 



2009). A popular choice of such a criterion is L\ regular- 
ization ( |Riezler and Vassermanl |2004[ |Dudik et al] [2004 ) 
which enjoys several good properties such as a convex ob- 
jective function and a consistency guarantee. However, L\- 
regularized MLE is usually sensitive to the choice the reg- 
ularization strength, and these optimization-based methods 
cannot provide a credible interval for the learned structure. 
Also, in order to leam a sparse structure, a strong penalty 
has to be imposed on all the edges which usually results in 
suboptimal parameter values. 

We will follow a third approach to MRF structure learn- 
ing in a fully Bayesian framework which has not been ex- 



plored yet. The Bayesian approach considers the structure 
of a graphical model as random. Inference in a Bayesian 
model provides inherent regularization, and offers a fully 
probabilistic characterization of the underlying structure. 
It was shown in [Park and Casefla (2008) that Bayesian 
models with a Gaussian or Laplace prior distribution (cor- 
responding to L2 or L\ regularization) do not exhibit 
sparse structure. Moha med et al.| ( |201 \\ proposes to use a 
"spike and slab" prior for learning directed graphical mod- 
els which corresponds to the ideal L regularization. This 
model exhibits better robustness against over-fitting than 
the related L\ approaches. Unlike the Laplace/Gaussian 
prior, the posterior distribution over parameters for a "spike 
and slab" prior is no longer guaranteed to be unimodal. 
However, approximate inference methods have been suc- 
cessfully applied in the context of directed models using 
MCMC (Mohamed et al. 201 l|l and expectation propaga- 
( |Herna 



MRFs with log-linear parametrization: 



tion (Hernandez-Lobato et al. , 2010). 



Unfortunately, Bayesian inference for MRFs is much 
harder than for directed networks due to the partition 
function. This feature renders even MCMC sampling in- 
tractable which caused some people to dub these prob- 
lems "double intractability" (Murray et al. 2006 1. Nev- 



ertheless, variational methods (Parise and Welling 2006 



Qi et al. 2005 ) and MCMC methods (|Murray and Ghahra 



mam 



2004 ) have been successfully explored for approxi- 



mate inference when the model structure is fixed. 

We propose a Bayesian structure learning method with a 
spike and slab prior for MRFs and devise an approximate 
MCMC method to draw samples of both the model struc- 
ture and the model parameters by combining a modified 
Langevin sampler with a reversible jump MCMC method. 
Experiments show that the posterior distribution estimated 
by our inference method matches the actual distribution 
very well. Moreover, our method offers better robustness 
to both under- fitting and over- fitting than L x -regularized 
methods. A related but different application of the spike 
and slab distribution in MRFs is shown in ICourville et al.l 
( |201 \\ for modelling hidden random variables. 

This paper is organized as follows: we first introduce a hi- 
erarchical Bayesian model for MRF structure learning in 
section|2]and then describe an approximate MCMC method 
in section|3][4j and[3]to draw samples for the model param- 
eters, structure, and other hyper-parameters respectively. 
Experiments are conducted in section [6] on two simulated 
data sets and a real-world dataset, followed by a discussion 
section. 



2 Learning the Structure of MRFs as 
Bayesian Inference 

The probability distribution of a MRF is defined by a set 
of potential functions. Consider a widely used family of 
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(1) 



where each potential function is defined as the exponential 
of the product between a feature function f a of a set of 
variables x Q and an associated parameter 8 a . Z is called 
the partition function. All the variables in the scope of a 
potential function form a clique in their graphical represen- 
tation. When a parameter 8 a has a value of zero, we could 
equivalently remove feature f a and all the edges between 
variables in x Q (if these variables are not also in the scope 
of other features) without changing the distribution of x. 
Therefore, by learning the parameters of this MRF model 
we can simultaneously learn the structure of a model if we 
allow some parameters to go to zero. 

The Bayesian learning approach to graphical models con- 
siders parameters as a random variable subject to a prior. 
Given observed data, we can infer the posterior distribu- 
tion of the parameters and their connectivity structure. Two 
commonly used priors, the Laplace and the Gaussian distri- 
bution, correspond to the L\ and L2 penalties respectively 
in the associated optimization-based approach. Although 
a model learned by L\ -penalized MLE is able to obtain a 
sparse structure, the full Bayesian treatment usually results 
in a fully connected model with many weak edges as ob- 



served in Park and Casella (2008 ), without special approx 



imate assumptions like the ones in Lin and Lee (2006 ). We 
propose to use the "spike and slab" prior to learn a sparse 
structure for MRFs in a fully Bayesian approach. The spike 



and slab prior (Mitchell and Beauchamp 1988, |Ishwaran 
|and Rao| |2005| ) is a mixture distribution which consists of 
a point mass at zero (spike) and a widely spread distribution 
(slab): 



P(0 o 



(l-po)S(9 a )+poAf(6 a ;0,*t) (2) 



where p £ [0, 1], 8 is the Dirac delta function, and ao is 
usually large enough to be uninformative. The spike com- 
ponent controls the sparsity of the structure in the poste- 
rior distribution while the slab component usually applies 
a mild shrinkage effect on the parameters of the existing 
edges even in a highly sparse model. This type of selective 
shrinkage is different from the global shrinkage imposed 
by L1/L2 regularization, and enjoys benefits in parameter 
estimation as demonstrated in the experiment section. 

The Bayesian MRF with the spike and slab prior is formu- 
lated as follows: 
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exp 



Q OL Y a A a 

Y a ~ Bern(po) Po ~ Beta(a, b) 
A a ~ Af(0, (Tq) <7 - 2 ~r(c,d) 



(3) 
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Figure 1: Illustration of the MCMC for A a and Y a 



where x is a set of state variables and a, b, c, and d are 
hyper-parameters. In the experiments we will use pairwise 
features in which case a = plus bias terms given by 
J2i Oififai) m the expression for the log-probability. We 
will use a normal prior (9j ~ A/"(0, of) for these bias terms. 
<7b is chosen to be large enough to act as an uninforma- 
tive prior. Y a is a binary random variable representing the 
existence of the edges in the clique x Q , and A a is the ac- 
tual value of the parameter 8 a when the edges are instan- 
tiated. It is easy to observe that given po and ern, 9 a has 
the same distribution as in equation [2] We use a hierarchi- 
cal model for 6 so that the inference will be insensitive to 
the choice of the hyper-parameters. In fact, experiments 
show that with a simple setting of the hyper-parameters, 
proper values of the sparsity parameter po and the variance 
(To are learned automatically by our model for all the data 
sets without the necessity of cross-validation. 

Unlike the optimization-based methods which estimate a 
single structure, the Bayesian approach expresses uncer- 
tainty about the existence of edges through its posterior 
distribution, P(Y\T>). We have applied a simple thresh- 
olding on P(Y a \D) for edge detection in the experiments 
although more sophisticated methods can conceivably give 
better results. 

Standard approaches to posterior inference do not work for 
Bayesian MRFs because it is intractable to compute the 
probability of a state x (due to the intractability of the par- 
tition function). We devised an approximate MCMC algo- 
rithm for inference, where we draw samples of the contin- 
uous variable A a by a modified Langevin dynamics algo- 
rithm, and samples of the discrete variable Y a jointly with 
A a by a reversible jump MCMC method, as illustrated in 
Figure [T]and explained in the following sections. 

3 Sampling Parameter Values by Langevin 
Dynamics 

Given Y, po, <tq, and an observed data set V = 
{x( m )}, m = 1 . . . N, the conditional distribution of pa- 
rameters {A a : Y a — 1} is the posterior distribution of an 
MRF with a fixed edge set induced by {a : Y a = 1} and an 
independent Gaussian prior 7V(0, CTq). We consider draw- 



ing samples of A a with fixed Y in this section and will use 
8 a and A a interchangeably to refer to a nonzero parameter. 
Even for an MRF model with a fixed structure, MCMC is 
still intractable. Approximate MCMC methods have been 
discussed in |Murray and Ghahramani (2004) among which 
Langevin Monte Carlo (LMC) with "brief sampling" to 
compute the required expectations in the gradients, shows 
good performance. (We lling and Teh||201 1) further shows 
that LMC with a noisy gradient can draw samples from the 
exact posterior distribution when the step size approaches 
zero. 

Langevin dynamics is described as the hybrid Monte Carlo 
(HMC) method with one leapfrog step in section 5.5.2 of 
|NeaH ( |20"T()] l: 

eC 



Pt+e/2 =Pt 



-g(o t ) 
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eC 
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t+s) 
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where p is the auxiliary momentum, e is the step size, C 
is a positive definite preconditioning matrix, and g is the 
gradient of the log-posterior probability log P(8\T>) Q A 
new value of p is drawn at every iteration from an isotropic 
Gaussian distribution JV(0,I) and then discarded after 9 
is updated. The leapfrog step is usually followed by a 
Metropolis-Hastings accept/reject step to ensure detailed 
balance. But since the rejection rate decays as e 3 , that step 
can be skipped for small step sizes without incurring much 
error. 

In a MRF, the gradient term g involves computing an ex- 
pectation over exponentially many states as 



N 
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The expectation is estimated by a set of s tate samples 
{x( s )} in the "brief Lan gevin" algorithm of Murray and 



[Ghahramani (2004 ), where these samples are drawn by run 
ning a few steps of Gibbs sampling initialized from a subset 
of the training data. We adopt the "brief Langevin" algo- 
rithm with three modifications for faster mixing. 

3.1 Persistent Gibbs Sampling 

We maintain a set of persistent Markov chains for the state 
samples {x^ s ^} by initializing Gibbs sampling at the last 
states of the previous iteration instead of the data. This 
is motivated by the persistent contrastive divergence algo- 



rithm of Tieleman (2008 ). When 6 changes slowly enough, 
the Gibbs sampler will approximately sample from the sta- 
tionary distribution, even when allowed a few steps at every 
iteration. 



'We omit all the other random variables that is conditioned 
on in this section for ease of notation. 



3.2 Preconditioning 

When the posterior distribution of {6 a } has different scales 
along different variables, the original LMC with a com- 
mon step size for all 6 a 's will traverse the parameter space 
slowly. We adopt a preconditioning matrix C to speed up 
the mixing, where C satisfies CC T = H with H is the 
Hessian matrix of log P(Omap\D )> computed as: 



Auto-correlation 



Auto-correlation 



H(0 MAP ) = Cov P(x |e MAp) /(x) + a Q 



(6) 



This is reminiscent of the observed Fisher information ma- 
except that we 



Girolami and Calderhead (2011 



tnx in 

use the MAP estimate with the prior. We approximate 
H(9map) by averaging over H(9 t ) during a burn-in pe- 
riod and estimate Covp( x |Q t )/ with the set of state samples 
from the persistent Markov chains. The adoption of a pre- 
conditioning matrix also helps us pick a common step size 
parameter e suitable for different training sets. 

3.3 Partial Momentum Refreshment 

The momentum term p in the leapfrog step represents the 
update direction of the parameter. Langevin dynamics is 
known to explore the parameter space through inefficient 
random walk behavior because it draws an independent 
sample for p at every iteration. We can achieve a bet- 
ter mixing rate with the partial momentum refreshment 
method proposed in Horowitz ( 1991| l. When p is updated 
at every step by: 



Pt <~ a Pt 



1lL t 



(7) 



where n t ~ jV(0,I), and a, f3 satisfy a 2 + fi 2 = 1, the 
momentum is partially preserved from the previous itera- 
tion and thereby suppresses the random-walk behavior in a 
similar fashion as HMC with multiple leapfrog steps. 

a controls how much momentum to be carried over from 
the previous iteration. With a large value of a, LMC re- 
duces the auto-correlation between samples significantly 
relative to LMC without partial momentum refreshment. 
The improved mixing rate is illustrated in Figure[2] We also 
show that the mean and standard deviation of the posterior 
distribution does not change. However, caution should be 
exercised especially when the step size r\ is large because a 
value of a that is too large would increase the error in the 
update equation which we do not correct with a Metropolis- 
Hastings step because that is intractable. 

4 Sampling Edges by Reversible Jump 
MCMC 

Langevin dynamics handles the continuous change in 
the parameter value A a . As for discrete changes in 
the model structure, Y a , we propose an approximate re- 
versible jump MCMC (RJMCMC) step ( |Green| [l995] l to 
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Figure 2: Comparison of Langevin dynamics on the block model 
in section |6.1| with partial momentum refreshment a = 0.9 
against a — 0. Step size -q — 10 -3 . Top: the auto-correlation 
of two typical parameters. Bottom: the posterior mean and stan- 
dard deviation of all parameters estimated with 10K samples. 



sample Y a and A a jointly from the conditional distri- 
bution P(A, Y|f>o, <Jq, T>). The proposed Markov chain 
adds/deletes one clique (or simply one edge when a = 
at a time. When an edge does not exist, i.e., Y a = 0, 
the variable A a can be excluded from the model, and 
therefore we consider the jump between a full model with 
Y a 7^ 0, A a = a and a degenerate model with Y a = 0. 

The proposed RJMCMC is as follows: when Y a = 0, 
propose adding an edge with probability P a( jd and sample 
A a = a from a proposal distribution q(A) with support on 
[— A Q , A Q ]; when Y a = 1 and \A a \ < A Q , propose delet- 
ing an edge with Pdei- The reason of restricting the pro- 
posed move within [— A Q , A Q ] will be explained later. It is 
easy to see that the Jacobian is 1. The jump is then accepted 
by the Metropolis-Hastings algorithm with a probability: 



Qadd =min{l,Q*(a)}, Q de i = min{l, 1/Q*(a)} 
Q*(a) = ex P (aE/ a (x^))( ^fl F " = 0) =a 



Z(Y a 

p N{A a = a|0,gg) Pdei 

(1-po) Paddq(A a = a) 



N 



(8) 



The factors in the first line of Q* represent the ratio of the 
model likelihoods while the other two are respectively the 
ratio of the prior distributions and the ratio of the proposal 
distributions. 

4.1 Unbiased Estimate to Q* and 1/Q* 

Computing the partition functions in equation [8] is gener- 
ally intractable. However, noticing that Z(Y a = 1, A a = 
a) — > Z(Y a = 0), as a — > 0, the log-ratio of the two parti- 
tion functions should be well approximated by a quadratic 
approximation at the origin when a is small. In this way 



we reduce the problem of estimating the partition function 
to computing the first and second order derivatives of the 
log-partition function. We employ a second order Taylor 
expansion for the ratio of partition functions in Q* as fol- 
lows: 
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We know that the kth order derivatives of the log-partition 
function of an MRF correspond to the fcth order centralized 
moments (or cumulants) of the features, that is, 



dlog(Z) 
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Given a set of n state samples x' s ) ~ P(x\Y a — 0) from 
the persistent Markov chains, we can compute an unbiased 
estimate of the mean and variance of f a as the sample mean 
f a and sample variance S a = E s (/a(x (s) )-/a) 2 /("-l) 
respectively. Consequently we obtain an estimate of R a dd 
by plugging f a and S a into equation [9J which is unbiased 
in the logarithmic domain, denoted as R ad d- 

An unbiased estimate in the logarithmic domain, unfortu- 
nately, is no longer unbiased once transformed to the linear 
domain. To correct the bias induced by the transformation, 
we take another Taylor expansion of R a dd/Radd around 
a = 0. After some derivation, we obtain an unbiased esti- 
mate of R a dd u P to me second order of a given by 



Radd(a) = exp -Naf 



with variance: 
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(11) 



Similarly, we can also obtain an unbiased estimate, Rdei, 
in 1/Q* when considering deleting an edge, with the same 
formula as R a dd except that a is replaced by —a and the 
sample mean and variance are now estimated with respect 
to P(x|y Q = 1, A a = a). If we plug in R add (or R de i) 
into Qadd (or Qdei) we get an unbiased estimate of the ac- 
ceptance probability except when Q* (or 1/Q*) is close to 
1 in which case the min operation causes extra bias. Since 
the variance can be computed as a function of a in equa- 
tion 1 1 we can estimate how large a jump range A can be 



used in order to keep the error in the acceptance probability 
negligible. A larger data set requires a smaller jump range 
or alternatively a larger sample set that grows quadratically 
with the size of the data set. 



4.2 Optimal Proposal Distribution q 

After plugging equations [9] [10] and [5] into equation [8] we 
obtain the acceptance probability as a function of a: 
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+ Var/ a + aNg a 



(12) 



where g a and Var/ Q are defined at 6 a = 0. Clearly, the 
optimal proposal distribution in terms of minimal variance 
is given by the following truncated Gaussian distribution 



qopt(A a = a\9) oc h(a), a e 



-A Q ,A Q ] 



(13) 



When adding an edge, we have state samples i.^ ~ 
P(x\Y a — 0). The expectation E/ Q in g a can be estimated 
by / Q ({x< s )}) and Varf Q by Sl({Z^}). When deleting 
an edge, the samples are from P(x\Y a — 1, A a = a). We 
apply a quadratic approximation for log Z(0 a ), use equa- 
tion 10 and estimate Var/ Q w S a and E/ Q m f a — aS a . 



4.3 Parallel Proposals 

Since the most computationally demanding step is to obtain 
a set of state samples {x^ s ^}, we want to reuse the samples 
whenever possible. Given that A Q is small enough, the pa- 
rameter value does not change much when we accept an 
"add or delete edge" move. We can thus assume that the 
distribution of A a on an edge is not affected much by an 
accepted move on other edges. As a result we do not have 
to rerun the Gibbs sampler after every edge operation, and 
we can propose jumps for all {a : \A a \ < A a } in parallel, 
using the same set of samples. This reduces the computa- 
tion time significantly. 

5 Sampling for Hyper-parameters 

Given A and Y, the hyper-parameters are easy to sample 
when using conjugate priors: 

Po \Y ~ Bcta(a + £ I(Y a = 1), b + I<Xa = 0)) 

a a 

a - 2 |Y,A~r( c +i]w(y Q = i),d+i Yl A D 

a a:Y a =l 

(14) 

where / is the indicator function. 

The whole inference algorithm is summarized in Alg [T] 

6 Experiments 
6.1 Datasets 

We assess the performance of our proposed method on two 
simulated data sets and one real-world data set. For the 



Algorithm 1 MCMC for Bayesian MRFs with Langevin 

Dynamics and RJMCMC 

(Parameters: number of iterations ITER, number of 
Gibbs sampling steps Ncibbs, sample size n, step size 
£, partial momentum refreshment a, RJMCMC proposal 
width A Q .) 

Initialize A, Y, and momentum p randomly 
for iter = l-> ITER do 

Sample po and <7o given A and Y 

Run Gibbs sampling for Ncibbs steps to draw 

{* (s) }"=i- 

Run LMC to draw {A a : Y a ^ 0}. 
Run RJMCMC to propose adding an edge for {a : 
Ya — 0}, and deleting an edge for {a : Y a = 
l,\A a \ < A a } 
end for 



simulated data, we randomly generate sparse Ising models 
with binary ±1 states and with parameters sampled from a 
Gaussian distribution J\f(0,a 2 ) where a = 0.5 for edges 
and 0.1 for node biases. These models are then converted 
to their equivalent Boltzmann machines with binary {0, 1} 
states from which we draw exact samples. Two structures 
are considered: (1) a block model with 12 nodes equally 
divided into 3 groups. Edges are added randomly within 
a group with a probability of 0.8, and across groups with 
0.1. Edges within a group are strong and positively cou- 
pled. There are 66 candidate edges with 20 edges in the 
ground truth model. (2) a 10 x 10 lattice with 4950 candi- 
date edges and with 180 edges in the ground truth model. 

For the real data, we use the MNIST digits image data. 
We convert the gray scale pixel values to binary values by 
thresholding at a value of 50, resize the images to a 14 x 14 
scale, and then pick a 9 x 12 patch centered in each im- 
age where the average value of each pixel is in the range 
of [10~ 4 , 1 — 10~ 4 ]. The last step is necessary because the 
other competing models do not have regularization on their 
biases, which will result in divergent parameter estimates 
for pixels that are always or 1. 

6.2 Model Specification 

We compare our Bayesian structure learning algorithm 



with two other approaches. One is proposed in|Wainwright 



et al. ( 2007 1 which recovers the neighbourhood of nodes 



by training separate L\ -regularized logistic regressions on 
each variable. While its goal is edge detection, we can 
also use it as a parameter estimation method with two out- 
put models, "Wain Max" and "Wain Min", as defined in 



Hofling and Tibshirani (2009 1. We implement the "Wain 



Max/Min" methods wit h the Lasso regularize d generalized 
linear model package of Friedman et al. (2010) j The other 



approach is one of the several variants of L\ -regularized 
MLE methods which use a pseud o-likelihood approxima- 
tion (Hofling and Tibshirani 2009 ), denoted as "MLE"[^] 



For our Bayesian model, we consider two schemes to spec- 
ify a model for prediction. One is the fully Bayesian ap- 
proach, referred as "Bayes", in which we random pick 100 
model samples in the Markov chain and approximate the 
Bayesian model by a mixture model of these 100 compo- 
nents. The other one is to obtain a single model by applying 
a threshold of 0.5 to P(Yij \T>) and estimate the posterior 
mean of the included edges, referred to as "Bayes PM". 

The performance of the Bayesian model is insensitive to 
the choice of hyper-parameters. We simply set a = b = 
c — d = 5 for po an d o"o> an d a b — 10 across all experi- 
ments. For the parameters of the MCMC method, we also 
use a common setting. We use a diagonal approximation 
to the feature covariance Covf and thereby the precondi- 
tioning matrix C. We set the sample size n = 100, num- 
ber of Gibbs sampling steps Naibbs = 1, LMC step size 
e = 10~ 3 , and momentum refreshment rate a = 0.9. The 
RJMCMC proposal width is set as A a = O.Ol/yOVVaf/ 
to achieve a small estimation error as in equation [TT] For 
each experiment, we run the MCMC algorithm to collect 
10K samples with a subsampling interval of 1000. Since 
the exact partition function can be computed on the small 
block model, we also run an exact MCMC, "Bayes Exact", 
with an exact gradient and accept/reject decision as well as 
larger values in s, a, and A than the approximate MCMC. 

Different levels of sparsity have to be considered in Li- 
based methods for an optimal regularization strength. For 
the Bayesian method, we learn a single sparsity level. How- 
ever, for the sake of comparison, we also consider a method 
with po as a parameter and vary it between (0, 1) to induce 
different sparsity, referred to with a suffix "po". 

6.3 Accuracy of Inference 

We first evaluate the validity of the proposed MCMC 
method on the block data where exact inference can be car- 
ried out. The marginal posterior distribution of an edge 
parameter, 8ij, is a mixture of a point mass at zero and a 
continuous component with a single mode. Figure[3]shows 
the histogram of samples in the continuous component of 
four randomly picked #i,/s. The title above each plot is 
the posterior probability of the edge being present 

in the model, i.e. 7^ or Yij = 1, In this figure, 
the marginal distribution estimated from the approximate 
MCMC method matches the distribution from "Bayes Ex- 
act" very well. For a more comprehensive comparison, we 
run "Bayes po" and "Bayes Exact po" methods, and vary 
the value of po from 10~ 4 to 0.99 to achieve different lev- 
els of sparsity. We estimate and collect across different po 
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Figure 3: Histogram of the parameter samples of four randomly 
picked edges from "Bayes" and "Bayes Exact" when Yij = 1. 
The training data is from the block model with N = 100. 
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Figure 4: Comparison of "Bayes po" and "Bayes Exact po" on 
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values the posterior probability of 9i j 7^ 0, posterior mean 
and standard deviation of 9i,j in the continuous component, 
as shown in Figure [4] Each point represents a parameter 
under some value of pq. We find the approximate MCMC 
procedure produces about the same values for these three 
statistics as the exact MCMC method. 

6.4 Simulated Data 

We then compare the performance of various methods on 
simulated data sets for two tasks: structure recovery and 
model estimation. The accuracy of recovering the true 
structure is measured by the precision and recall of the true 
edges. The quality of the estimated models is evaluated 
by the predictive performance on a held-out validation set. 
Since computing the log-likelihood of a MRF is in general 
intractable, we use the conditional log-likelihood (CLL) on 
a group of variables instead, which is a generalization of 



the conditional marginal log-likelihood in Lee et al. (2006). 
For each data case in the validation set, we randomly 
choose a group of variables, which is all the variables in 
the block model and a 3 x 3 grid in the lattice and MNIST 
models, and compute \ogP({x i \ iegroup \{x j }, J( f group ). 

We train models on the simulated data ranging between 50 
and 1000 items. For the Bayesian models, we remove an 
edge if the posterior probability P(X%,i = < 0.5. 
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Figure 5: Precision-Recall Curves for the lattice data with 100 
data cases. 



Figure [5] shows typical precision-recall curves for different 
methods. It turns out that all the models with a sparsity 
tuning parameter perform similarly across all the training 
sets. The "Bayes" model tends to find a structure with high 
precision. 

We then consider the average CLL as a function of the edge 
density of a model. The edge density is defined as the per- 
centage of edges present in a model, i.e., 1 — sparsity. For 
the fully Bayesian model, we measure the average density 
in the Markov chain. In fact, the variance of the edge den- 
sity is usually small, suggesting that most model samples 
have about the same number of edges. We show the av- 
erage CLL of different models trained with 100 data cases 
in Figure [6] The results on other data sizes are omitted 
because they have the same tendency. All the Bayesian 
models give robust prediction performance in the sparse 
and dense model ranges and "Bayes po" outperforms all 
the other methods with a tunable sparsity parameter for al- 
most all settings of pq. Moreover, the curve of "Bayes po" 
peaks at the true density level. In contrast, L\ methods 
would underfit to the data for sparse models or overfit for 
dense models. "MLE" performs better than "Wain" mod- 
els. This makes sense as the "Wain" models were not de- 
signed for MRF parameter estimation. Figure [7] compares 
the Bayesian models with exact and approximate inference. 
Again, the approximate MCMC method generates about 
the same posterior distribution as the "exact" method. 

The difference between Bayesian models and ii-based 
models could be partially explained by the different 
prior/regularization. As shown in Figure [8] to achieve a 
sparse structure, we have to use strong regularization in 
the L\ models which causes global shrinkage for all pa- 
rameters, resulting in under-fitting. On the other hand, to 
obtain a dense structure, weak regularization must be ap- 
plied globally which leads to over- fitting. In contrast, with 
a spike and slab prior, the parameter value of existing edges 
is not affected directly by po. Instead, their variance is con- 
trolled by another random variable 00 which fits the data 
automatically with a weak hierarchical prior. The behavior 
of selective shrinkage in the spike and slab prior is also dis- 
cussed in |Mohamed et al.| ( |20TTT ); |Ishwaran and Raolfl2005] l. 
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: Mean and standard deviation of average CLL at different density levels for block 
lattice (right) model with 100 training data cases. "Wain Min" is not plotted as it 
inferior to "Wain Max". Vertical line: true edge density. 



Figure 7: Average CLL at differ- 
ent density levels for the block model 
with 100 training cases. 
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Figure 8: Absolute value of edge parameters in the true model, 
"Wain Max", and "Bayes PM" for lattice data with 100 samples. 
Parameters are plotted only if they are not zero in at least one 
model and sorted along the x-axis by the absolute value in the 
true model. The parameters in "Wain Max" are too small in sparse 
models (top, edge density ~ 3%) and too large in dense models 
(bottom, 65%). Inset: zoom-in at the right-hand side. "Wain Min" 
and "MLE" are similar to "Wain Max". 

The Bayesian models, "Bayes po" and "Bayes PM po" do 
however show an interesting "under-fitting" phenomenon 
at a large density levels. This is due to a misspecified value 
for po. Since the standard deviation ern. is shared by all the 
Aij's whose Yi j = 1, when we fix p at an improperly 
large value, it forces a lot of non-existing edges to be in- 
cluded in the model, which consequently brings down the 
posterior distribution of <tq. This results in too small values 
on real edges as shown in the inset of Figure [8] and thereby 
a decrease in the model predictive accuracy. 

However, this "misbehavior" in return just suggests the 
ability of our Bayesian model to learn the true structure. 
Once we release po through a hierarchical prior, the model 
will abandon these improper values in po and automati- 
cally find a good structure and parameters. The vertical line 
in Figure [6] indicates the sparsity of the true model. Both 
"Bayes" and "Bayes PM" find sparsity levels very close to 
the true value, while L\ -based methods are under- fitted at 
that same level as shown in the upper panel of Figure[8] We 
show the joint performance of edge detection and parame- 
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Figure 9: CLL vs Fl score for block and lattice model with 100 
data cases. A good model is at the upper right corner. 



ter learning in Figure [9] where the performance of edge de- 
tection is summarized by the Fl score (the harmonic mean 
of the precision and recall). The Bayesian models with a hi- 
erarchical po prior achieve both a high F- 1 score and CLL 
value near the upper right comer. 

6.5 MNIST Data 

Since there does not exist ground truth in the model struc- 
ture of the MIST data set, we evaluate how well we can 
learn a sparse model for prediction. Figure [10] shows the 
average CLL on 10K test images with a model trained on 
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Figure 10: Mean and standard deviation of average CLL versus 
edge density on MNIST with 100 and 1000 data cases. 

100 and 1000 images respectively. In the sparse and dense 
model ranges, we observe again a better performance of 
"Bayes" than Li-based methods. "Bayes PM" also shows 
robustness to under/over-fitting although it seems that sim- 
ply computing the posterior mean does not provide suffi- 
ciently good model parameters in the median density range. 

To get a more intuitive comparison about the quality of 
learned sparse models, we train models on 1000 images by 
different methods with a density of 0.2 and then run Gibbs 
sampling to draw 36 samples from each model. The im- 

While it is hard to get good 



ages are shown in Figure 1 1 



reconstruction using a model without hidden variables, the 
Bayesian methods produce qualitatively better images than 
competing methods, even though "Bayes PM" does not 
have higher CLL than "MLE" at this level. 

A common limitation of learning Bayesian models with the 
MCMC inference method is that it is much slower than 
training a model with a point estimate. However, as shown 
in the experiments, the Bayesian methods are able to learn 
a good combination of parameters and a structure without 
the need to tune hyper-parameters through cross validation. 
Also, the Bayesian methods learn sparser models than L\- 
based methods without sacrificing predictive performance 
significantly. Because the computational complexity of 
inference grows exponentially with the maximum clique 
size of MRFs, L\ -based models at their optimal (not so 
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Figure 1 1 : Samples from learned models at an edge density level 
of 0.2. 

sparse) regularization level can in fact become significantly 
more computationally expensive than their Bayesian coun- 
terparts at prediction time. Turning up the regularization 
will result in sparser models but at the cost of under-fitting 
the data and thus sacrificing predictive accuracy. 

7 Discussion 

We propose Bayesian structure learning for MRFs with a 
spike and slab prior. An approximate MCMC method is 
proposed to achieve effective inference based on Langevin 
dynamics and reversible jump MCMC. As far as we known 
this is the first attempt to learn MRF structures in the 
fully Bayesian approach using spike and slab priors. Re- 



lated work was presented in Parise and Welling (2006) 
with a variational method for Bayesian MRF model selec- 
tion. However this method can only compare a given list of 
candidate models instead of searching in the exponentially 
large structure space. 

The proposed MCMC method is shown to provide accu- 
rate posterior distributions at small step sizes. The selective 
shrinkage property of the spike and slab prior enables us to 
learn an MRF at different sparsity levels without noticeably 
suffering from under-fitting or over-fitting even for a small 
data set. Experiments with simulated data and real-world 
data show that the Bayesian method can leam both an ac- 
curate structure and a set of parameter values with strong 
predictive performance. In contrast the Li-based methods 
could fail to accomplish both tasks with a single choice 
of the regularization strength. Also, the performance of 
our Bayesian model is largely insensitive to the choice of 
hyper-parameters. It provides an automated way to choose 
a proper sparsity level, while L\ methods usually rely on 
cross-validation to find their optimal regularization setting. 
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